Visualizing & Summarizing Numerical Data

STAT 313

Data Visualizations with ggplot2

What are the aesthetics in this plot?

What geometric object is being plotted?

Univariate (One Variable) Visualizations – For Numerical Data

  • Histogram
  • Boxplot
  • Density Plot

Histogram

ggplot(data = penguins, mapping = aes(x = bill_length_mm)) + 
  geom_histogram() +
  labs(x = "Bill Length (mm)")

Is this aesthetic global or local?

Pros

  • Easy to inspect
  • Higher bars represent where data are relatively more common
  • Inspect shape of a distribution (skewed or symmetric)
  • Identify modes

Cons

  • Do not plot raw data, plot summaries (counts) of the data!
  • Sensitive to binwidth

Boxplot

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm)) +
  geom_boxplot() + 
  labs(x = "Bill Length (mm)")

  • What calculations are necessary to create a boxplot?

  • What are strengths of a boxplot?

  • What are weaknesses of a boxplot?

Density Plot

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm)) +
  geom_density()

  • A smooth approximation to a variable’s distribution
  • Plots density (as a proportion) on the y-axis

Bivariate (Two Variables) Visualizations – For Numerical Data

  • Scatterplots

  • Faceted Histograms

  • Side-by-Side Boxplots

  • Stacked Density Plots (Ridge Plots)

Scatterplots

ggplot(data = penguins,
       mapping = aes(y = bill_length_mm, x = bill_depth_mm)) +
  geom_point() +
  labs(x = "Bill Depth (mm)", 
       y = "Bill Length (mm)")

Multivariate Plots

There are two main methods for adding a third (or fourth) variable into a data visualization:

Colors

  • creates colors for every level of a categorical variable
  • creates a gradient for different values of a quantitative variable

Facets

  • creates subplots for every level of a variable
  • labels each sub-plot with the value of the variable

Colors in Scatterplots – Categorical Variable

ggplot(data = penguins,
       mapping = aes(y = bill_length_mm,
                     x = bill_depth_mm,
                     color = species)
       ) +
  geom_point() +
  labs(x = "Bill Depth (mm)", 
       y = "Bill Length (mm)", 
       color = "Penguin Species")

Colors in Scatterplots – Numerical Variable

ggplot(data = penguins,
       mapping = aes(y = bill_length_mm,
                     x = bill_depth_mm,
                     color = body_mass_g)
       ) +
  geom_point() +
  labs(x = "Bill Depth (mm)", 
       y = "Bill Length (mm)", 
       color = "Body Mass (g)")

Facets in Scatterplots – Categorical Variable

ggplot(data = penguins,
       mapping = aes(y = bill_length_mm,
                     x = bill_depth_mm)) +
  geom_point() +
  facet_wrap(~ species) + 
  labs(x = "Bill Depth (mm)", 
       y = "Bill Length (mm)")

Facets in Scatterplots – Numerical Variable

ggplot(data = penguins,
       mapping = aes(y = bill_length_mm,
                     x = bill_depth_mm)) +
  geom_point() +
  facet_wrap(~ body_mass_g) + 
  labs(x = "Bill Depth (mm)", 
       y = "Bill Length (mm)")

Summarizing Numerical Data

Given this distribution…

What measure of center would you use? Why?

Summary Statistics

  • Measures of Center
  • Measures of Spread

Measures of Center

Mean

  • average of data values
  • not resistant to outliers

Median

  • middle observation
  • resistant to outliers

For right skewed data…

For symmetric (and bimodal) data…

Measures of Spread

Not Resistant


Variance: average squared distance from the mean


Range: difference between minimum and maximum

Resistant


Inner Quartile Range (IQR): difference between Q1 and Q3

Point Estimates & Parameters


Parameter: True value of the statistic for the population of interest


Point Estimate: provides our best guess for the value of the parameter


Estimates based on larger samples tend to be more accurate than those based on smaller samples.